2025-05-02-16-25
Urban Air Mobility as a System of Systems: An LLM-Enhanced Holonic Approach
Abstract
arXiv:2505.00368v1 Announce Type: new Abstract: Urban Air Mobility (UAM) is an emerging System of System (SoS) that faces challenges in system architecture, planning, task management, and execution. Traditional architectural approaches struggle with scalability, adaptability, and seamless resource integration within dynamic and complex environments. This paper presents an intelligent holonic architecture that incorporates Large Language Model (LLM) to manage the complexities of UAM. Holons function semi autonomously, allowing for real time coordination among air taxis, ground transport, and vertiports. LLMs process natural language inputs, generate adaptive plans, and manage disruptions such as weather changes or airspace closures.Through a case study of multimodal transportation with electric scooters and air taxis, we demonstrate how this architecture enables dynamic resource allocation, real time replanning, and autonomous adaptation without centralized control, creating more resilient and efficient urban transportation networks. By advancing decentralized control and AI driven adaptability, this work lays the groundwork for resilient, human centric UAM ecosystems, with future efforts targeting hybrid AI integration and real world validation.
摘要
城市空中交通(UAM)作为一种新兴的系统之系统(SoS),在系统架构、规划、任务管理与执行方面面临诸多挑战。传统架构方法难以在动态复杂环境中实现可扩展性、适应性与无缝资源整合。本文提出一种融合大语言模型(LLM)的智能整体架构,以应对UAM的复杂性。整体单元以半自主方式运行,实现空中出租车、地面交通与垂直起降场间的实时协同。LLM通过处理自然语言输入生成自适应计划,并管理天气变化、空域关闭等突发状况。基于电动滑板车与空中出租车的多式联运案例研究,我们验证了该架构如何在不依赖集中控制的情况下实现动态资源分配、实时重规划与自主适应,从而构建更具韧性与效率的城市交通网络。通过推进去中心化控制与人工智能驱动的适应性,本研究为构建以人为中心的韧性UAM生态系统奠定基础,未来工作将聚焦混合人工智能集成与现实场景验证。
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Abstract
arXiv:2505.00212v1 Announce Type: new Abstract: Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available at https://github.com/mingyin1/Agents_Failure_Attribution
摘要
大语言模型多智能体系统中的故障归因——即识别导致任务失败的智能体及责任步骤——为系统调试提供了关键线索,但该领域仍研究不足且依赖人工。本文提出并定义了一个新的研究方向:大语言模型多智能体系统的自动化故障归因。为支持该研究,我们发布了Who&When数据集,包含127个大语言模型多智能体系统的详细故障日志,并标注了故障关联的特定智能体及关键错误步骤。基于Who&When数据集,我们开发并评估了三种自动化故障归因方法,总结了各自的优缺点。最佳方法在识别责任智能体时达到53.5%准确率,但在定位故障步骤时仅14.2%,部分方法表现甚至低于随机水平。即使如OpenAI o1和DeepSeek R1等最先进的推理模型也未能达到实用要求。这些结果揭示了该任务的复杂性,凸显了进一步研究的必要性。代码与数据集详见https://github.com/mingyin1/Agents_Failure_Attribution
UserCentrix: An Agentic Memory-augmented AI Framework for Smart Spaces
Abstract
arXiv:2505.00472v1 Announce Type: new Abstract: Agentic AI, with its autonomous and proactive decision-making, has transformed smart environments. By integrating Generative AI (GenAI) and multi-agent systems, modern AI frameworks can dynamically adapt to user preferences, optimize data management, and improve resource allocation. This paper introduces UserCentrix, an agentic memory-augmented AI framework designed to enhance smart spaces through dynamic, context-aware decision-making. This framework integrates personalized Large Language Model (LLM) agents that leverage user preferences and LLM memory management to deliver proactive and adaptive assistance. Furthermore, it incorporates a hybrid hierarchical control system, balancing centralized and distributed processing to optimize real-time responsiveness while maintaining global situational awareness. UserCentrix achieves resource-efficient AI interactions by embedding memory-augmented reasoning, cooperative agent negotiation, and adaptive orchestration strategies. Our key contributions include (i) a self-organizing framework with proactive scaling based on task urgency, (ii) a Value of Information (VoI)-driven decision-making process, (iii) a meta-reasoning personal LLM agent, and (iv) an intelligent multi-agent coordination system for seamless environment adaptation. Experimental results across various models confirm the effectiveness of our approach in enhancing response accuracy, system efficiency, and computational resource management in real-world application.
摘要
具备自主决策能力的代理型人工智能正在改变智能环境。通过整合生成式人工智能(GenAI)与多智能体系统,现代AI框架能够动态适应用户偏好、优化数据管理并改进资源分配。本文提出UserCentrix框架——一种基于记忆增强的代理型AI架构,旨在通过动态情境感知决策提升智能空间性能。该框架集成个性化大语言模型(LLM)代理,利用用户偏好与LLM记忆管理机制,提供主动式自适应辅助。此外,系统采用混合分层控制架构,通过集中式与分布式处理的平衡优化实时响应能力,同时保持全局态势感知。通过嵌入记忆增强推理、协同代理协商与自适应编排策略,UserCentrix实现了资源高效的AI交互。我们的核心贡献包括:(i)基于任务紧急性的自组织主动扩展框架;(ii)信息价值(VoI)驱动的决策流程;(iii)具备元推理能力的个人化LLM代理;(iv)支持无缝环境适应的智能多代理协调系统。多模型实验结果表明,该方法在提升实际应用中的响应精度、系统效率及计算资源管理方面具有显著效果。
RAIL in the Wild: Operationalizing Responsible AI Evaluation Using Anthropic's Value Dataset
Abstract
arXiv:2505.00204v1 Announce Type: new Abstract: As AI systems become embedded in real-world applications, ensuring they meet ethical standards is crucial. While existing AI ethics frameworks emphasize fairness, transparency, and accountability, they often lack actionable evaluation methods. This paper introduces a systematic approach using the Responsible AI Labs (RAIL) framework, which includes eight measurable dimensions to assess the normative behavior of large language models (LLMs). We apply this framework to Anthropic's "Values in the Wild" dataset, containing over 308,000 anonymized conversations with Claude and more than 3,000 annotated value expressions. Our study maps these values to RAIL dimensions, computes synthetic scores, and provides insights into the ethical behavior of LLMs in real-world use.
摘要
随着人工智能系统日益嵌入现实应用场景,确保其符合伦理标准变得至关重要。现有AI伦理框架虽强调公平性、透明度和问责制,但往往缺乏可操作的评估方法。本文采用'负责任人工智能实验室'(RAIL)框架提出系统性解决方案,该框架包含八个可量化维度用于评估大语言模型(LLM)的规范性行为。我们将该框架应用于Anthropic公司'野生环境中的价值观'数据集(包含超过308,000条与Claude模型的匿名对话及3,000余条标注的价值表述),通过将这些价值映射至RAIL维度、计算综合评分,揭示了LLM在真实应用场景中的伦理行为特征。
Position Paper: Towards Open Complex Human-AI Agents Collaboration System for Problem-Solving and Knowledge Management
Abstract
arXiv:2505.00018v1 Announce Type: new Abstract: This position paper critically surveys a broad spectrum of recent empirical developments on human-AI agents collaboration, highlighting both their technical achievements and persistent gaps. We observe a lack of a unifying theoretical framework that can coherently integrate these varied studies, especially when tackling open-ended, complex tasks. To address this, we propose a novel conceptual architecture: one that systematically interlinks the technical details of multi-agent coordination, knowledge management, cybernetic feedback loops, and higher-level control mechanisms. By mapping existing contributions, from symbolic AI techniques and connectionist LLM-based agents to hybrid organizational practices, onto this proposed framework (Hierarchical Exploration-Exploitation Net), our approach facilitates revision of legacy methods and inspires new work that fuses qualitative and quantitative paradigms. The paper's structure allows it to be read from any section, serving equally as a critical review of technical implementations and as a forward-looking reference for designing or extending human-AI symbioses. Together, these insights offer a stepping stone toward deeper co-evolution of human cognition and AI capability.
摘要
本立场文件批判性地审视了人机智能体协作领域近期广泛的经验性进展,既凸显了技术成就,也揭示了持续存在的空白。我们注意到当前缺乏一个能够统合这些多样化研究的理论框架,尤其在处理开放式复杂任务时表现尤为明显。为此,我们提出了一种新颖的概念架构:该架构系统性地将多智能体协调、知识管理、控制论反馈循环与高层控制机制等技术细节相互关联。通过将现有贡献(从符号AI技术、基于连接主义大语言模型的智能体到混合组织实践)映射到所提出的框架(分层探索-开发网络)上,我们的方法既促进了对传统方法的修正,也启发了融合定性与定量范式的新研究。本文的结构设计允许从任意章节开始阅读,既可视为对技术实现的批判性综述,也可作为设计或扩展人机共生系统的前瞻性参考。这些见解共同为人类认知与AI能力的深度协同进化奠定了基石。
Can LLMs Help Improve Analogical Reasoning For Strategic Decisions? Experimental Evidence from Humans and GPT-4
Abstract
arXiv:2505.00603v1 Announce Type: new Abstract: This study investigates whether large language models, specifically GPT4, can match human capabilities in analogical reasoning within strategic decision making contexts. Using a novel experimental design involving source to target matching, we find that GPT4 achieves high recall by retrieving all plausible analogies but suffers from low precision, frequently applying incorrect analogies based on superficial similarities. In contrast, human participants exhibit high precision but low recall, selecting fewer analogies yet with stronger causal alignment. These findings advance theory by identifying matching, the evaluative phase of analogical reasoning, as a distinct step that requires accurate causal mapping beyond simple retrieval. While current LLMs are proficient in generating candidate analogies, humans maintain a comparative advantage in recognizing deep structural similarities across domains. Error analysis reveals that AI errors arise from surface level matching, whereas human errors stem from misinterpretations of causal structure. Taken together, the results suggest a productive division of labor in AI assisted organizational decision making where LLMs may serve as broad analogy generators, while humans act as critical evaluators, applying the most contextually appropriate analogies to strategic problems.
摘要
本研究探讨了大型语言模型(特别是GPT4)在战略决策情境下的类比推理能力能否与人类相匹敌。通过采用源目标匹配的新型实验设计,我们发现GPT4通过检索所有可能类比实现了高召回率,但精确度较低,经常基于表面相似性错误应用类比。相比之下,人类参与者表现出高精确度但低召回率,他们选择的类比数量较少但因果关联性更强。这些发现通过将类比推理的评估阶段——匹配识别为需要超越简单检索的准确因果映射的独立步骤,推动了理论发展。当前大型语言模型虽然擅长生成候选类比,但人类在识别跨领域深层结构相似性方面仍具比较优势。错误分析表明,AI错误源于表层匹配,而人类错误则来自对因果结构的误解。综合来看,研究结果揭示了AI辅助组织决策中一种有效的分工模式:大型语言模型可作为广泛的类比生成器,而人类则充当关键评估者,将最符合情境的类比应用于战略问题。
Distributed Retrieval-Augmented Generation
Abstract
arXiv:2505.00443v1 Announce Type: new Abstract: As large language models (LLMs) become increasingly adopted on edge devices, Retrieval-Augmented Generation (RAG) is gaining prominence as a solution to address factual deficiencies and hallucinations by integrating external knowledge. However, centralized RAG architectures face significant challenges in data privacy and scalability. For instance, smart healthcare services often rely on collecting sensitive patient data and building a centralized knowledge base to provide better diagnosis and treatment advice, while privacy concerns significantly impede this process. Besides, maintaining a comprehensive and continuously updated knowledge base is costly, particularly in response to regional epidemics and rapidly mutating viruses. To address these challenges, this paper introduces Distributed Retrieval-Augmented Generation (DRAG), a novel framework that improves data privacy by eliminating the need for a centralized knowledge base and restoring data control to owners. DRAG incorporates a Topic-Aware Random Walk (TARW) algorithm that leverages LLMs to extract query topics and facilitate targeted peer discovery within a peer-to-peer network, enabling efficient knowledge retrieval in decentralized environments. Extensive experiments across three diverse datasets and LLMs demonstrate that DRAG with TARW achieves near-centralized RAG performance by using half as many messages as flooding. The code is available at https://github.com/xuchenhao001/DRAG.
摘要
随着大语言模型(LLMs)在边缘设备上的应用日益广泛,检索增强生成(RAG)技术通过整合外部知识来解决事实性缺陷和幻觉问题的重要性逐渐凸显。然而,集中式RAG架构在数据隐私和可扩展性方面面临重大挑战。例如,智能医疗服务通常依赖收集敏感患者数据并构建集中式知识库以提供更好的诊疗建议,而隐私问题严重阻碍了这一进程。此外,维护一个全面且持续更新的知识库成本高昂,尤其是在应对区域性流行病和快速变异的病毒时。为应对这些挑战,本文提出分布式检索增强生成(DRAG)框架,该框架通过消除集中式知识库需求并将数据控制权归还所有者,显著提升了数据隐私性。DRAG引入了一种主题感知随机游走(TARW)算法,利用LLMs提取查询主题并在点对点网络中实现精准节点发现,从而在去中心化环境中实现高效知识检索。基于三个不同数据集和多种LLMs的广泛实验表明,采用TARW的DRAG仅需洪泛法一半的消息量即可达到接近集中式RAG的性能。代码已开源:https://github.com/xuchenhao001/DRAG。
Combining LLMs with Logic-Based Framework to Explain MCTS
Abstract
arXiv:2505.00610v1 Announce Type: new Abstract: In response to the lack of trust in Artificial Intelligence (AI) for sequential planning, we design a Computational Tree Logic-guided large language model (LLM)-based natural language explanation framework designed for the Monte Carlo Tree Search (MCTS) algorithm. MCTS is often considered challenging to interpret due to the complexity of its search trees, but our framework is flexible enough to handle a wide range of free-form post-hoc queries and knowledge-based inquiries centered around MCTS and the Markov Decision Process (MDP) of the application domain. By transforming user queries into logic and variable statements, our framework ensures that the evidence obtained from the search tree remains factually consistent with the underlying environmental dynamics and any constraints in the actual stochastic control process. We evaluate the framework rigorously through quantitative assessments, where it demonstrates strong performance in terms of accuracy and factual consistency.
摘要
针对人工智能(AI)在序列规划中可信度不足的问题,我们设计了一种基于计算树逻辑引导的大语言模型(LLM)自然语言解释框架,该框架专为蒙特卡洛树搜索(MCTS)算法而设计。由于搜索树的复杂性,MCTS通常被认为难以解释,但我们的框架具有足够的灵活性,能够处理围绕MCTS和应用领域马尔可夫决策过程(MDP)的各种自由形式事后查询与基于知识的询问。通过将用户查询转化为逻辑和变量语句,我们的框架确保从搜索树获取的证据始终与底层环境动态及实际随机控制过程中的任何约束保持事实一致性。通过定量评估对该框架进行严格验证,结果表明其在准确性和事实一致性方面均表现出色。
Open-Source LLM-Driven Federated Transformer for Predictive IoV Management
Abstract
arXiv:2505.00651v1 Announce Type: new Abstract: The proliferation of connected vehicles within the Internet of Vehicles (IoV) ecosystem presents critical challenges in ensuring scalable, real-time, and privacy-preserving traffic management. Existing centralized IoV solutions often suffer from high latency, limited scalability, and reliance on proprietary Artificial Intelligence (AI) models, creating significant barriers to widespread deployment, particularly in dynamic and privacy-sensitive environments. Meanwhile, integrating Large Language Models (LLMs) in vehicular systems remains underexplored, especially concerning prompt optimization and effective utilization in federated contexts. To address these challenges, we propose the Federated Prompt-Optimized Traffic Transformer (FPoTT), a novel framework that leverages open-source LLMs for predictive IoV management. FPoTT introduces a dynamic prompt optimization mechanism that iteratively refines textual prompts to enhance trajectory prediction. The architecture employs a dual-layer federated learning paradigm, combining lightweight edge models for real-time inference with cloud-based LLMs to retain global intelligence. A Transformer-driven synthetic data generator is incorporated to augment training with diverse, high-fidelity traffic scenarios in the Next Generation Simulation (NGSIM) format. Extensive evaluations demonstrate that FPoTT, utilizing EleutherAI Pythia-1B, achieves 99.86% prediction accuracy on real-world data while maintaining high performance on synthetic datasets. These results underscore the potential of open-source LLMs in enabling secure, adaptive, and scalable IoV management, offering a promising alternative to proprietary solutions in smart mobility ecosystems.
摘要
车联网(IoV)生态系统中互联车辆的激增对实现可扩展、实时且隐私保护的交通管理提出了关键挑战。现有集中式车联网解决方案普遍存在高延迟、可扩展性有限及依赖专有人工智能(AI)模型等问题,这为大规模部署(尤其在动态且隐私敏感的环境中)设置了显著障碍。与此同时,大型语言模型(LLMs)在车辆系统中的集成应用仍待深入探索,特别是在提示优化与联邦场景下的有效利用方面。为应对这些挑战,我们提出联邦提示优化交通变换器(FPoTT)——一种利用开源LLMs进行预测性车联网管理的新型框架。FPoTT引入动态提示优化机制,通过迭代优化文本提示以提升轨迹预测性能。该架构采用双层联邦学习范式,将轻量级边缘模型(用于实时推理)与基于云端的LLMs(用于保持全局智能)相结合,并集成基于Transformer的合成数据生成器,以下一代仿真(NGSIM)格式生成多样化高保真交通场景来增强训练。大量实验表明,采用EleutherAI Pythia-1B的FPoTT在真实数据上实现了99.86%的预测准确率,同时在合成数据集上保持优异性能。这些结果印证了开源LLMs在实现安全、自适应、可扩展车联网管理方面的潜力,为智能出行生态系统中的专有解决方案提供了有前景的替代选择。
LangVAE and LangSpace: Building and Probing for Language Model VAEs
Abstract
arXiv:2505.00004v1 Announce Type: cross Abstract: We present LangVAE, a novel framework for modular construction of variational autoencoders (VAEs) on top of pre-trained large language models (LLMs). Such language model VAEs can encode the knowledge of their pre-trained components into more compact and semantically disentangled representations. The representations obtained in this way can be analysed with the LangVAE companion framework: LangSpace, which implements a collection of probing methods, such as vector traversal and interpolation, disentanglement measures, and cluster visualisations. LangVAE and LangSpace offer a flexible, efficient and scalable way of building and analysing textual representations, with simple integration for models available on the HuggingFace Hub. Additionally, we conducted a set of experiments with different encoder and decoder combinations, as well as annotated inputs, revealing a wide range of interactions across architectural families and sizes w.r.t. generalisation and disentanglement. Our findings demonstrate a promising framework for systematising the experimentation and understanding of textual representations.
摘要
我们提出LangVAE——一种基于预训练大语言模型(LLM)实现变分自编码器(VAE)模块化构建的新框架。此类语言模型VAE能够将其预训练组件的知识编码为更紧凑且语义解耦的表示。通过该框架获得的表征可通过配套分析工具LangSpace进行研究,该工具集成了向量遍历与插值、解耦度测量及聚类可视化等探测方法。LangVAE与LangSpace为构建和分析文本表征提供了灵活、高效且可扩展的解决方案,并能便捷集成HuggingFace Hub上的模型。我们通过不同编码器-解码器组合及标注输入的实验,揭示了模型架构家族与规模在泛化能力和解耦特性方面存在的广泛交互关系。研究结果表明,该框架为系统化实验与文本表征理解提供了可行方案。
Toward a digital twin of U.S. Congress
Abstract
arXiv:2505.00006v1 Announce Type: cross Abstract: In this paper we provide evidence that a virtual model of U.S. congresspersons based on a collection of language models satisfies the definition of a digital twin. In particular, we introduce and provide high-level descriptions of a daily-updated dataset that contains every Tweet from every U.S. congressperson during their respective terms. We demonstrate that a modern language model equipped with congressperson-specific subsets of this data are capable of producing Tweets that are largely indistinguishable from actual Tweets posted by their physical counterparts. We illustrate how generated Tweets can be used to predict roll-call vote behaviors and to quantify the likelihood of congresspersons crossing party lines, thereby assisting stakeholders in allocating resources and potentially impacting real-world legislative dynamics. We conclude with a discussion of the limitations and important extensions of our analysis.
摘要
本文通过实证研究表明,基于语言模型集合构建的美国国会议员虚拟模型符合数字孪生的定义。我们重点介绍并概述了一个每日更新的数据集,该数据集收录了每位美国国会议员在任期内发布的所有推文。研究证明,利用针对特定议员定制的数据子集,现代语言模型能够生成与其真实推文高度相似的虚拟推文。我们进一步阐明,这些生成推文可用于预测议员的唱名表决行为,并量化其跨党派投票的可能性,从而帮助利益相关方优化资源配置,并可能对现实立法动态产生影响。最后,我们探讨了本研究的局限性及未来重要的拓展方向。
Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models
Abstract
arXiv:2505.00010v1 Announce Type: cross Abstract: Jailbreaking in Large Language Models (LLMs) threatens their safe use in sensitive domains like education by allowing users to bypass ethical safeguards. This study focuses on detecting jailbreaks in 2-Sigma, a clinical education platform that simulates patient interactions using LLMs. We annotated over 2,300 prompts across 158 conversations using four linguistic variables shown to correlate strongly with jailbreak behavior. The extracted features were used to train several predictive models, including Decision Trees, Fuzzy Logic-based classifiers, Boosting methods, and Logistic Regression. Results show that feature-based predictive models consistently outperformed Prompt Engineering, with the Fuzzy Decision Tree achieving the best overall performance. Our findings demonstrate that linguistic-feature-based models are effective and explainable alternatives for jailbreak detection. We suggest future work explore hybrid frameworks that integrate prompt-based flexibility with rule-based robustness for real-time, spectrum-based jailbreak monitoring in educational LLMs.
摘要
大型语言模型(LLMs)中的越狱行为会使用户绕过伦理防护措施,威胁其在教育等敏感领域的安全使用。本研究重点检测临床教育平台2-Sigma中的越狱行为,该平台利用LLMs模拟患者互动。我们使用四个与越狱行为高度相关的语言变量,对158组对话中的2300余条提示进行了标注。提取的特征被用于训练多种预测模型,包括决策树、基于模糊逻辑的分类器、提升方法以及逻辑回归。结果表明,基于特征的预测模型始终优于提示工程,其中模糊决策树取得了最佳整体性能。我们的研究证明,基于语言特征的模型是越狱检测中有效且可解释的替代方案。建议未来工作探索混合框架,将基于提示的灵活性与基于规则的鲁棒性相结合,以实现教育类LLMs中基于频谱的实时越狱监测。
Sparks of Tabular Reasoning via Text2SQL Reinforcement Learning
Abstract
arXiv:2505.00016v1 Announce Type: cross Abstract: This work reframes the Text-to-SQL task as a pathway for teaching large language models (LLMs) to reason over and manipulate tabular data--moving beyond the traditional focus on query generation. We propose a two-stage framework that leverages SQL supervision to develop transferable table reasoning capabilities. First, we synthesize detailed chain-of-thought (CoT) traces from real-world SQL queries, providing step-by-step, clause-level supervision that teaches the model how to traverse, filter, and aggregate table fields. Second, we introduce a Group Relative Policy Optimization (GRPO) reinforcement learning objective that connects SQL execution accuracy to generalizable reasoning by encouraging steps that extend beyond task-specific syntax and transfer across datasets. Empirically, our approach improves performance on standard Text-to-SQL benchmarks and achieves substantial gains on reasoning-intensive datasets such as BIRD and CRT-QA, demonstrating enhanced generalization and interpretability. Specifically, the distilled-quantized LLaMA model achieved a 20% increase in accuracy when trained on Text-to-SQL tasks, while Qwen achieved a 5% increase. These results suggest that SQL can serve not only as a target formalism but also as an effective scaffold for learning robust, transferable reasoning over structured data.
摘要
本研究将文本到SQL任务重新定义为培养大型语言模型(LLMs)进行表格数据推理与操作的学习路径,突破了传统查询生成的局限。我们提出一个两阶段框架,利用SQL监督来开发可迁移的表格推理能力:首先,从真实SQL查询合成详细的思维链(CoT)轨迹,提供分步骤、子句级的监督,指导模型如何遍历、筛选和聚合表格字段;其次,引入组相对策略优化(GRPO)强化学习目标,通过鼓励超越任务特定语法且能跨数据集迁移的推理步骤,将SQL执行准确率与泛化推理能力相关联。实验表明,该方法不仅提升了标准文本到SQL基准的性能,更在BIRD和CRT-QA等推理密集型数据集上取得显著进步,展现出增强的泛化能力和可解释性。具体而言,经过文本到SQL任务训练的蒸馏量化LLaMA模型准确率提升20%,Qwen模型提升5%。这些结果表明,SQL不仅能作为目标形式化语言,更能成为学习结构化数据稳健可迁移推理的有效脚手架。
ReCellTy: Domain-specific knowledge graph retrieval-augmented LLMs workflow for single-cell annotation
Abstract
arXiv:2505.00017v1 Announce Type: cross Abstract: To enable precise and fully automated cell type annotation with large language models (LLMs), we developed a graph structured feature marker database to retrieve entities linked to differential genes for cell reconstruction. We further designed a multi task workflow to optimize the annotation process. Compared to general purpose LLMs, our method improves human evaluation scores by up to 0.21 and semantic similarity by 6.1% across 11 tissue types, while more closely aligning with the cognitive logic of manual annotation.
摘要
为实现基于大语言模型(LLMs)的精准全自动化细胞类型注释,我们开发了图结构特征标记数据库,用于检索与差异基因关联的实体以进行细胞重建。进一步设计了多任务工作流程以优化注释过程。与通用LLMs相比,本方法在11种组织类型中将人工评估分数最高提升0.21,语义相似度提高6.1%,同时更贴近人工注释的认知逻辑。
An Empirical Study on Prompt Compression for Large Language Models
Abstract
arXiv:2505.00019v1 Announce Type: cross Abstract: Prompt engineering enables Large Language Models (LLMs) to perform a variety of tasks. However, lengthy prompts significantly increase computational complexity and economic costs. To address this issue, we study six prompt compression methods for LLMs, aiming to reduce prompt length while maintaining LLM response quality. In this paper, we present a comprehensive analysis covering aspects such as generation performance, model hallucinations, efficacy in multimodal tasks, word omission analysis, and more. We evaluate these methods across 13 datasets, including news, scientific articles, commonsense QA, math QA, long-context QA, and VQA datasets. Our experiments reveal that prompt compression has a greater impact on LLM performance in long contexts compared to short ones. In the Longbench evaluation, moderate compression even enhances LLM performance. Our code and data is available at https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression.
摘要
提示工程使大语言模型(LLMs)能够执行多种任务。然而,冗长的提示会显著增加计算复杂性和经济成本。为解决这一问题,我们研究了六种针对LLMs的提示压缩方法,旨在缩短提示长度的同时保持模型响应质量。本文提出了涵盖生成性能、模型幻觉、多模态任务有效性、词汇省略分析等方面的综合分析。我们在13个数据集上评估了这些方法,包括新闻、科学文章、常识问答、数学问答、长上下文问答以及视觉问答数据集。实验表明,与短上下文相比,提示压缩对长上下文中LLM性能的影响更为显著。在Longbench评估中,适度压缩甚至能提升LLM性能。相关代码和数据可在https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression获取。
Beyond Public Access in LLM Pre-Training Data
Abstract
arXiv:2505.00020v1 Announce Type: cross Abstract: Using a legally obtained dataset of 34 copyrighted O'Reilly Media books, we apply the DE-COP membership inference attack method to investigate whether OpenAI's large language models were trained on copyrighted content without consent. Our AUROC scores show that GPT-4o, OpenAI's more recent and capable model, demonstrates strong recognition of paywalled O'Reilly book content (AUROC = 82%), compared to OpenAI's earlier model GPT-3.5 Turbo. In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O'Reilly book samples. GPT-4o Mini, as a much smaller model, shows no knowledge of public or non-public O'Reilly Media content when tested (AUROC 50%). Testing multiple models, with the same cutoff date, helps us account for potential language shifts over time that might bias our findings. These results highlight the urgent need for increased corporate transparency regarding pre-training data sources as a means to develop formal licensing frameworks for AI content training
摘要
通过使用合法获取的34本受版权保护的O'Reilly Media书籍数据集,我们应用DE-COP成员推理攻击方法,调查OpenAI的大型语言模型是否在未经许可的情况下使用了受版权保护的内容进行训练。我们的AUROC评分显示,与OpenAI早期模型GPT-3.5 Turbo相比,其最新且性能更强的模型GPT-4o对付费墙保护的O'Reilly书籍内容表现出较强的识别能力(AUROC = 82%)。相反,GPT-3.5 Turbo对公开可访问的O'Reilly书籍样本表现出相对更高的识别度。而作为一个小得多的模型,GPT-4o Mini在测试中对公开或非公开的O'Reilly Media内容均未显示出识别能力(AUROC ≈ 50%)。通过测试具有相同截止日期的多个模型,我们能够排除时间推移可能导致的语言变化对研究结果的干扰。这些结果凸显了企业迫切需要提高预训练数据源的透明度,以建立AI内容训练的正式授权框架。
Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
Abstract
arXiv:2505.00022v1 Announce Type: cross Abstract: Scaling data quantity is essential for large language models (LLMs), yet recent findings show that data quality can significantly boost performance and training efficiency. We introduce a German-language dataset curation pipeline that combines heuristic and model-based filtering techniques with synthetic data generation. We use our pipeline to create Aleph-Alpha-GermanWeb, a large-scale German pre-training dataset which draws from: (1) Common Crawl web data, (2) FineWeb2, and (3) synthetically-generated data conditioned on actual, organic web data. We evaluate our dataset by pre-training both a 1B Llama-style model and an 8B tokenizer-free hierarchical autoregressive transformer (HAT). A comparison on German-language benchmarks, including MMMLU, shows significant performance gains of Aleph-Alpha-GermanWeb over FineWeb2 alone. This advantage holds at the 8B scale even when FineWeb2 is enriched by human-curated high-quality data sources such as Wikipedia. Our findings support the growing body of evidence that model-based data curation and synthetic data generation can significantly enhance LLM pre-training datasets.
摘要
扩大数据规模对大型语言模型(LLM)至关重要,但近期研究表明,提升数据质量能显著提高模型性能和训练效率。我们提出了一套德语数据集构建流程,结合启发式与基于模型的过滤技术,并辅以合成数据生成。利用该流程,我们创建了Aleph-Alpha-GermanWeb——一个大规模德语预训练数据集,其数据来源包括:(1)Common Crawl网络数据,(2)FineWeb2,以及(3)基于真实网络数据生成的合成数据。我们通过预训练一个10亿参数的Llama风格模型和一个80亿参数的无分词器层次自回归变换器(HAT)来评估数据集性能。在包括MMMLU在内的德语基准测试中,Aleph-Alpha-GermanWeb相比仅使用FineWeb2展现出显著性能优势。即使FineWeb2补充了维基百科等人工精选的高质量数据源,这种优势在80亿参数规模下依然存在。我们的研究进一步证明:基于模型的数据筛选和合成数据生成能显著提升LLM预训练数据集质量。
CORG: Generating Answers from Complex, Interrelated Contexts
Abstract
arXiv:2505.00023v1 Announce Type: cross Abstract: In a real-world corpus, knowledge frequently recurs across documents but often contains inconsistencies due to ambiguous naming, outdated information, or errors, leading to complex interrelationships between contexts. Previous research has shown that language models struggle with these complexities, typically focusing on single factors in isolation. We classify these relationships into four types: distracting, ambiguous, counterfactual, and duplicated. Our analysis reveals that no single approach effectively addresses all these interrelationships simultaneously. Therefore, we introduce Context Organizer (CORG), a framework that organizes multiple contexts into independently processed groups. This design allows the model to efficiently find all relevant answers while ensuring disambiguation. CORG consists of three key components: a graph constructor, a reranker, and an aggregator. Our results demonstrate that CORG balances performance and efficiency effectively, outperforming existing grouping methods and achieving comparable results to more computationally intensive, single-context approaches.
摘要
在现实世界的语料库中,知识经常在不同文档间重复出现,但由于命名模糊、信息过时或错误等原因常存在不一致性,导致上下文之间形成复杂的相互关系。先前研究表明,语言模型难以应对这种复杂性,通常只能孤立地处理单一因素。我们将这些关系归类为四种类型:干扰性、模糊性、反事实性和重复性。分析表明,现有方法无法同时有效处理所有这些相互关系。为此,我们提出了上下文组织器(CORG),该框架通过将多个上下文组织成独立处理的组别,使模型既能高效找到所有相关答案,又能确保消歧效果。CORG包含三个核心组件:图构造器、重排序器和聚合器。实验结果表明,CORG在性能与效率之间取得了良好平衡,不仅优于现有分组方法,其效果还可与计算量更大的单上下文处理方法相媲美。
Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning
Abstract
arXiv:2505.00024v1 Announce Type: cross Abstract: Enabling large language models with external tools has become a pivotal strategy for extending their functionality beyond text generation tasks. Prior work typically enhances tool-use abilities by either applying supervised fine-tuning (SFT) to enforce tool-call correctness or distilling reasoning traces from stronger models for SFT. However, both approaches fall short, either omitting reasoning entirely or producing imitative reasoning that limits generalization. Inspired by the success of DeepSeek-R1 in eliciting reasoning through rule-based reinforcement learning, we develop the Nemotron-Research-Tool-N1 series of tool-using language models using a similar training paradigm. Instead of restrictively supervising intermediate reasoning traces distilled from stronger models, Nemotron-Research-Tool-N1 is optimized with a binary reward that evaluates only the structural validity and functional correctness of tool invocations. This lightweight supervision allows the model to autonomously internalize reasoning strategies, without the need for annotated reasoning trajectories. Experiments on the BFCL and API-Bank benchmarks show that Nemotron-Research-Tool-N1-7B and Nemotron-Research-Tool-N1-14B, built on Qwen-2.5-7B/14B-Instruct, achieve state-of-the-art results, outperforming GPT-4o on both evaluations.
摘要
为大型语言模型配备外部工具已成为扩展其文本生成功能之外能力的关键策略。现有研究通常通过两种方式增强工具使用能力:应用监督微调(SFT)确保工具调用的正确性,或从更强模型中蒸馏推理轨迹用于SFT。然而这两种方法均存在不足——前者完全省略推理过程,后者产生的模仿性推理限制了泛化能力。受DeepSeek-R1通过基于规则的强化学习成功激发推理的启发,我们采用类似训练范式开发了Nemotron-Research-Tool-N1系列工具调用语言模型。该模型摒弃了对强模型蒸馏中间推理轨迹的严格监督,转而采用仅评估工具调用结构有效性和功能正确性的二元奖励机制进行优化。这种轻量级监督使模型能自主内化推理策略,而无需标注推理轨迹。在BFCL和API-Bank基准测试中,基于Qwen-2.5-7B/14B-Instruct构建的Nemotron-Research-Tool-N1-7B和Nemotron-Research-Tool-N1-14B均取得最先进成果,在两项评估中超越GPT-4o。
A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1
Abstract
arXiv:2505.00025v1 Announce Type: cross Abstract: In recent years, despite foundation models like DeepSeek-R1 and ChatGPT demonstrating significant capabilities in general tasks, professional knowledge barriers, computational resource requirements, and deployment environment limitations have severely hindered their application in actual medical scenarios. Addressing these challenges, this paper proposes an efficient lightweight medical vertical large language model architecture method, systematically solving the lightweight problem of medical large models from three dimensions: knowledge acquisition, model compression, and computational optimization. At the knowledge acquisition level, a knowledge transfer pipeline is designed from the fine-tuned DeepSeek-R1-Distill-70B teacher model to the DeepSeek-R1-Distill-7B student model, and Low-Rank Adaptation (LoRA) technology is adopted to precisely adjust key attention layers. At the model compression level, compression techniques including 4-bit weight quantization are implemented while preserving the core representation ability for medical reasoning. At the computational optimization level, inference optimization techniques such as Flash Attention acceleration and continuous batching are integrated, and a professional prompt template system is constructed to adapt to different types of medical problems. Experimental results on medical question-answering datasets show that the method proposed in this paper maintains professional accuracy while reducing memory consumption by 64.7% and inference latency by 12.4%, providing an effective solution for the application of medical large models in resource-constrained environments such as edge computing devices.
摘要
近年来,尽管DeepSeek-R1和ChatGPT等基础模型在通用任务中展现出强大能力,但专业知识壁垒、计算资源需求和部署环境限制严重阻碍了其在真实医疗场景的应用。针对这些挑战,本文提出一种高效的轻量化医疗垂直领域大语言模型架构方法,从知识获取、模型压缩和计算优化三个维度系统性地解决医疗大模型的轻量化问题。在知识获取层面,设计了从微调后的DeepSeek-R1-Distill-70B教师模型到DeepSeek-R1-Distill-7B学生模型的知识迁移流程,并采用低秩自适应(LoRA)技术对关键注意力层进行精准调整。在模型压缩层面,实施了包含4比特权重量化在内的压缩技术,同时保持医疗推理的核心表征能力。在计算优化层面,集成了Flash Attention加速和连续批处理等推理优化技术,并构建了专业提示模板系统以适应不同类型医疗问题。在医疗问答数据集上的实验结果表明,本文提出的方法在保持专业准确性的同时,内存消耗降低64.7%,推理延迟减少12.4%,为医疗大模型在边缘计算设备等资源受限环境中的应用提供了有效解决方案。
Theory of Mind in Large Language Models: Assessment and Enhancement
Abstract
arXiv:2505.00026v1 Announce Type: cross Abstract: Theory of Mind (ToM)-the ability to infer and reason about others' mental states-is fundamental to human social intelligence. As Large Language Models (LLMs) become increasingly integrated into daily life, it is crucial to assess and enhance their capacity to interpret and respond to human mental states. In this paper, we review LLMs' ToM capabilities by examining both evaluation benchmarks and the strategies designed to improve them. We focus on widely adopted story-based benchmarks and provide an in-depth analysis of methods aimed at enhancing ToM in LLMs. Furthermore, we outline promising future research directions informed by recent benchmarks and state-of-the-art approaches. Our survey serves as a valuable resource for researchers interested in advancing LLMs' ToM capabilities.
摘要
心理理论(Theory of Mind, ToM)——即推断和推理他人心理状态的能力——是人类社会智能的基础。随着大语言模型(Large Language Models, LLMs)日益融入日常生活,评估并提升其理解和响应人类心理状态的能力变得至关重要。本文通过考察评估基准和改进策略,系统梳理了LLMs的心理理论能力。我们重点关注广泛采用的故事型基准测试,并对提升LLMs心理理论能力的方法进行了深入分析。此外,基于最新基准测试和最先进方法,我们提出了未来具有前景的研究方向。本综述为致力于推进LLMs心理理论能力的研究者提供了重要参考。
Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation
Abstract
arXiv:2505.00028v1 Announce Type: cross Abstract: In recent years, end-to-end speech-to-speech (S2S) dialogue systems have garnered increasing research attention due to their advantages over traditional cascaded systems, including achieving lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these end-to-end systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries, eliminating the need for intermediate speech-to-text conversion via techniques like ASR. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. We will release the code and dataset to support reproducibility and promote further research in this area.
摘要
近年来,端到端语音对话系统因其较传统级联系统的优势而获得越来越多研究关注,这些优势包括实现更低延迟以及更自然地整合情感和说话人身份等非语言信息。然而,这些端到端系统面临关键挑战,特别是在融入外部知识方面——这一能力通常由基于文本的大语言模型中的检索增强生成技术实现。核心难点在于输入语音与检索文本知识之间的模态差异阻碍了有效整合。为解决该问题,我们提出一种新颖的端到端检索增强生成框架,可直接从语音查询中检索相关文本知识,无需通过自动语音识别等技术进行中间语音转文本转换。实验结果表明,我们的方法显著提升了端到端语音对话系统性能,同时实现了更高检索效率。虽然整体性能仍落后于级联模型,但该框架为增强端到端语音对话系统的知识整合提供了可行方向。我们将公开代码和数据集以支持结果复现,并推动该领域的进一步研究。
Keep the General, Inject the Specific: Structured Dialogue Fine-Tuning for Knowledge Injection without Catastrophic Forgetting
Abstract
arXiv:2505.00029v1 Announce Type: cross Abstract: Large Vision Language Models have demonstrated impressive versatile capabilities through extensive multimodal pre-training, but face significant limitations when incorporating specialized knowledge domains beyond their training distribution. These models struggle with a fundamental dilemma: direct adaptation approaches that inject domain-specific knowledge often trigger catastrophic forgetting of foundational visual-linguistic abilities. We introduce Structured Dialogue Fine-Tuning (SDFT), an effective approach that effectively injects domain-specific knowledge while minimizing catastrophic forgetting. Drawing inspiration from supervised fine-tuning in LLMs and subject-driven personalization in text-to-image diffusion models, our method employs a three-phase dialogue structure: Foundation Preservation reinforces pre-trained visual-linguistic alignment through caption tasks; Contrastive Disambiguation introduces carefully designed counterfactual examples to maintain semantic boundaries; and Knowledge Specialization embeds specialized information through chain-of-thought reasoning. Experimental results across multiple domains confirm SDFT's effectiveness in balancing specialized knowledge acquisition with general capability retention. Our key contributions include a data-centric dialogue template that balances foundational alignment with targeted knowledge integration, a weighted multi-turn supervision framework, and comprehensive evaluation across diverse knowledge types.
摘要
大规模视觉语言模型通过广泛的多模态预训练展现出卓越的通用能力,但在融入训练分布之外的专业知识领域时面临显著局限。这些模型存在一个根本性困境:直接注入领域知识的适应方法往往会引发基础视觉-语言能力的灾难性遗忘。我们提出结构化对话微调(SDFT),该方法能有效注入领域知识,同时最大限度减少灾难性遗忘。受大语言模型监督微调和文本到图像扩散模型中主体驱动个性化的启发,我们的方法采用三阶段对话结构:基础保持阶段通过描述任务强化预训练的视觉-语言对齐;对比消歧阶段引入精心设计的反事实样本以维持语义边界;知识专业化阶段通过思维链推理嵌入专业知识。跨多个领域的实验结果证实SDFT在平衡专业知识获取与通用能力保留方面的有效性。我们的核心贡献包括:平衡基础对齐与目标知识整合的数据中心化对话模板、加权多轮监督框架,以及针对多样化知识类型的全面评估。
Learning to Plan Before Answering: Self-Teaching LLMs to Learn Abstract Plans for Problem Solving
Abstract
arXiv:2505.00031v1 Announce Type: cross Abstract: In the field of large language model (LLM) post-training, the effectiveness of utilizing synthetic data generated by the LLM itself has been well-presented. However, a key question remains unaddressed: what essential information should such self-generated data encapsulate? Existing approaches only produce step-by-step problem solutions, and fail to capture the abstract meta-knowledge necessary for generalization across similar problems. Drawing insights from cognitive science, where humans employ high-level abstraction to simplify complex problems before delving into specifics, we introduce a novel self-training algorithm: LEarning to Plan before Answering (LEPA). LEPA trains the LLM to formulate anticipatory plans, which serve as abstract meta-knowledge for problem-solving, before engaging with the intricacies of problems. This approach not only outlines the solution generation path but also shields the LLM from the distraction of irrelevant details. During data generation, LEPA first crafts an anticipatory plan based on the problem, and then generates a solution that aligns with both the plan and the problem. LEPA refines the plan through self-reflection, aiming to acquire plans that are instrumental in yielding correct solutions. During model optimization, the LLM is trained to predict both the refined plans and the corresponding solutions. By efficiently extracting and utilizing the anticipatory plans, LEPA demonstrates remarkable superiority over conventional algorithms on various challenging natural language reasoning benchmarks.
摘要
在大语言模型(LLM)后训练领域,利用模型自身生成的合成数据已被证明具有显著效果。然而,一个关键问题尚未得到解决:这类自生成数据应当包含哪些核心信息?现有方法仅能生成逐步的问题解决方案,却未能捕捉到跨相似问题泛化所需的抽象元知识。受认知科学启发——人类在深入细节前会运用高层抽象来简化复杂问题——我们提出一种新型自训练算法:作答前学习规划(LEPA)。该算法训练大语言模型在应对问题复杂性之前,先构建预期规划作为问题解决的抽象元知识。这种方法不仅勾勒出解决方案的生成路径,还能使模型免受无关细节干扰。在数据生成阶段,LEPA首先基于问题创建预期规划,随后生成与该规划和问题均匹配的解决方案;通过自我反思机制优化规划,旨在获得对生成正确解决方案具有指导价值的规划。在模型优化阶段,大语言模型被训练同时预测优化后的规划及其对应解决方案。通过高效提取和运用预期规划,LEPA在多项具有挑战性的自然语言推理基准测试中展现出超越传统算法的显著优势。
MDD-LLM: Towards Accuracy Large Language Models for Major Depressive Disorder Diagnosis
Abstract
arXiv:2505.00032v1 Announce Type: cross Abstract: Major depressive disorder (MDD) impacts more than 300 million people worldwide, highlighting a significant public health issue. However, the uneven distribution of medical resources and the complexity of diagnostic methods have resulted in inadequate attention to this disorder in numerous countries and regions. This paper introduces a high-performance MDD diagnosis tool named MDD-LLM, an AI-driven framework that utilizes fine-tuned large language models (LLMs) and extensive real-world samples to tackle challenges in MDD diagnosis. Therefore, we select 274,348 individual information from the UK Biobank cohort to train and evaluate the proposed method. Specifically, we select 274,348 individual records from the UK Biobank cohort and design a tabular data transformation method to create a large corpus for training and evaluating the proposed approach. To illustrate the advantages of MDD-LLM, we perform comprehensive experiments and provide several comparative analyses against existing model-based solutions across multiple evaluation metrics. Experimental results show that MDD-LLM (70B) achieves an accuracy of 0.8378 and an AUC of 0.8919 (95% CI: 0.8799 - 0.9040), significantly outperforming existing machine learning and deep learning frameworks for MDD diagnosis. Given the limited exploration of LLMs in MDD diagnosis, we examine numerous factors that may influence the performance of our proposed method, such as tabular data transformation techniques and different fine-tuning strategies.
摘要
重度抑郁症(MDD)影响着全球超过3亿人口,已成为重大公共卫生问题。然而,医疗资源分配不均与诊断方法复杂性导致该疾病在许多国家和地区未能获得充分关注。本文提出一种名为MDD-LLM的高性能诊断工具,该人工智能驱动框架通过微调大语言模型(LLMs)并结合大规模真实世界样本,以解决MDD诊断中的挑战。为此,我们从英国生物银行队列中筛选274,348条个体信息用于方法训练与评估。具体而言,我们设计了一种表格数据转换方法,构建大规模语料库以支持所提方案的训练与验证。为展示MDD-LLM优势,我们开展全面实验,并在多维度评估指标下与现有模型解决方案进行对比分析。实验结果表明,MDD-LLM(70B)取得0.8378的准确率与0.8919的AUC值(95%置信区间:0.8799-0.9040),显著优于现有机器学习与深度学习诊断框架。鉴于LLMs在MDD诊断领域研究尚属有限,我们深入探究了可能影响方法性能的多重因素,包括表格数据转换技术与不同微调策略等。
Improving Phishing Email Detection Performance of Small Large Language Models
Abstract
arXiv:2505.00034v1 Announce Type: cross Abstract: Large language models(LLMs) have demonstrated remarkable performance on many natural language processing(NLP) tasks and have been employed in phishing email detection research. However, in current studies, well-performing LLMs typically contain billions or even tens of billions of parameters, requiring enormous computational resources. To reduce computational costs, we investigated the effectiveness of small-parameter LLMs for phishing email detection. These LLMs have around 3 billion parameters and can run on consumer-grade GPUs. However, small LLMs often perform poorly in phishing email detection task. To address these issues, we designed a set of methods including Prompt Engineering, Explanation Augmented Fine-tuning, and Model Ensemble to improve phishing email detection capabilities of small LLMs. We validated the effectiveness of our approach through experiments, significantly improving accuracy on the SpamAssassin dataset from around 0.5 for baseline models like Qwen2.5-1.5B-Instruct to 0.976.
摘要
大型语言模型(LLMs)在众多自然语言处理(NLP)任务中展现出卓越性能,并已被应用于钓鱼邮件检测研究。然而,当前研究中表现优异的LLMs通常包含数十亿甚至数百亿参数,需要巨大的计算资源。为降低计算成本,我们探究了小参数LLMs在钓鱼邮件检测中的有效性。这些LLMs约含30亿参数,可在消费级GPU上运行。但小规模LLMs在钓鱼邮件检测任务中往往表现不佳。针对这一问题,我们设计了一套方法,包括提示工程、解释增强微调及模型集成,以提升小规模LLMs的钓鱼邮件检测能力。通过实验验证,我们的方法显著提高了模型性能,在SpamAssassin数据集上的准确率从Qwen2.5-1.5B-Instruct等基线模型的约0.5提升至0.976。
Fact-Consistency Evaluation of Text-to-SQL Generation for Business Intelligence Using Exaone 3.5
Abstract
arXiv:2505.00060v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown promise in enabling natural language interfaces for structured data querying through text-to-SQL generation. However, their application in real-world Business Intelligence (BI) contexts remains limited due to semantic hallucinations, structural errors, and a lack of domain-specific evaluation frameworks. In this study, we propose a Fact-Consistency Evaluation Framework for assessing the semantic accuracy of LLM-generated SQL outputs using Exaone 3.5--an instruction-tuned, bilingual LLM optimized for enterprise tasks. We construct a domain-specific benchmark comprising 219 natural language business questions across five SQL complexity levels, derived from actual sales data in LG Electronics' internal BigQuery environment. Each question is paired with a gold-standard SQL query and a validated ground-truth answer. We evaluate model performance using answer accuracy, execution success rate, semantic error rate, and non-response rate. Experimental results show that while Exaone 3.5 performs well on simple aggregation tasks (93% accuracy in L1), it exhibits substantial degradation in arithmetic reasoning (4% accuracy in H1) and grouped ranking tasks (31% in H4), with semantic errors and non-responses concentrated in complex cases. Qualitative error analysis further identifies common failure types such as misapplied arithmetic logic, incomplete filtering, and incorrect grouping operations. Our findings highlight the current limitations of LLMs in business-critical environments and underscore the need for fact-consistency validation layers and hybrid reasoning approaches. This work contributes a reproducible benchmark and evaluation methodology for advancing reliable natural language interfaces to structured enterprise data systems.
摘要
大型语言模型(LLMs)在通过文本到SQL生成实现结构化数据查询的自然语言接口方面展现出潜力。然而,由于语义幻觉、结构错误以及缺乏领域特定评估框架,其在真实商业智能(BI)场景中的应用仍受限。本研究提出一个事实一致性评估框架,利用专为企业任务优化的指令微调双语模型Exaone 3.5,用于评估LLM生成SQL输出的语义准确性。我们构建了一个领域特定基准测试,包含219个跨五个SQL复杂度级别的自然语言业务问题,数据源自LG电子内部BigQuery环境中的实际销售记录。每个问题均配有黄金标准SQL查询和经过验证的基准答案。通过答案准确率、执行成功率、语义错误率和无响应率等指标评估模型性能。实验结果表明,Exaone 3.5在简单聚合任务表现良好(L1级93%准确率),但在算术推理(H1级4%准确率)和分组排序任务(H4级31%)中性能显著下降,语义错误和无响应主要集中在复杂案例中。定性错误分析进一步识别出常见失败类型,如算术逻辑误用、过滤条件不完整和分组操作错误。本研究揭示了LLM在业务关键环境中的当前局限性,强调需要事实一致性验证层和混合推理方法。本工作贡献了可复现的基准测试和评估方法,以推进面向企业结构化数据系统的可靠自然语言接口发展。
CoordField: Coordination Field for Agentic UAV Task Allocation In Low-altitude Urban Scenarios
Abstract
arXiv:2505.00091v1 Announce Type: cross Abstract: With the increasing demand for heterogeneous Unmanned Aerial Vehicle (UAV) swarms to perform complex tasks in urban environments, system design now faces major challenges, including efficient semantic understanding, flexible task planning, and the ability to dynamically adjust coordination strategies in response to evolving environmental conditions and continuously changing task requirements. To address the limitations of existing approaches, this paper proposes coordination field agentic system for coordinating heterogeneous UAV swarms in complex urban scenarios. In this system, large language models (LLMs) is responsible for interpreting high-level human instructions and converting them into executable commands for the UAV swarms, such as patrol and target tracking. Subsequently, a Coordination field mechanism is proposed to guide UAV motion and task selection, enabling decentralized and adaptive allocation of emergent tasks. A total of 50 rounds of comparative testing were conducted across different models in a 2D simulation space to evaluate their performance. Experimental results demonstrate that the proposed system achieves superior performance in terms of task coverage, response time, and adaptability to dynamic changes.
摘要
随着城市环境中执行复杂任务的异构无人机群需求日益增长,系统设计面临重大挑战,包括高效语义理解、灵活任务规划以及根据环境条件演变和任务需求持续变化动态调整协调策略的能力。针对现有方法的局限性,本文提出一种用于复杂城市场景下异构无人机群协调的协调场代理系统。该系统采用大语言模型(LLMs)负责解析高层级人类指令并将其转化为可执行的无人机群指令(如巡逻与目标追踪),继而提出协调场机制来引导无人机运动与任务选择,实现突发任务的去中心化自适应分配。研究在二维仿真空间中对不同模型进行了共计50轮对比测试以评估其性能。实验结果表明,所提系统在任务覆盖率、响应时间及动态变化适应性方面均表现出优越性能。
Optimization of embeddings storage for RAG systems using quantization and dimensionality reduction techniques
Abstract
arXiv:2505.00105v1 Announce Type: cross Abstract: Retrieval-Augmented Generation enhances language models by retrieving relevant information from external knowledge bases, relying on high-dimensional vector embeddings typically stored in float32 precision. However, storing these embeddings at scale presents significant memory challenges. To address this issue, we systematically investigate on MTEB benchmark two complementary optimization strategies: quantization, evaluating standard formats (float16, int8, binary) and low-bit floating-point types (float8), and dimensionality reduction, assessing methods like PCA, Kernel PCA, UMAP, Random Projections and Autoencoders. Our results show that float8 quantization achieves a 4x storage reduction with minimal performance degradation (<0.3%), significantly outperforming int8 quantization at the same compression level, being simpler to implement. PCA emerges as the most effective dimensionality reduction technique. Crucially, combining moderate PCA (e.g., retaining 50% dimensions) with float8 quantization offers an excellent trade-off, achieving 8x total compression with less performance impact than using int8 alone (which provides only 4x compression). To facilitate practical application, we propose a methodology based on visualizing the performance-storage trade-off space to identify the optimal configuration that maximizes performance within their specific memory constraints.
摘要
检索增强生成技术通过从外部知识库检索相关信息来增强语言模型,其依赖于通常以float32精度存储的高维向量嵌入。然而,大规模存储这些嵌入向量会带来显著的内存挑战。为解决这一问题,我们在MTEB基准上系统研究了两种互补的优化策略:量化(评估标准格式如float16、int8、二值化及低比特浮点类型float8)和降维(评估PCA、核PCA、UMAP、随机投影及自编码器等方法)。实验结果表明,float8量化能以最小性能损失(<0.3%)实现4倍存储压缩,显著优于同等压缩级别的int8量化,且实现更简单。PCA被证明是最有效的降维技术。关键发现是,适度PCA(如保留50%维度)与float8量化的组合能提供最佳平衡,在实现8倍总压缩率的同时,其性能影响甚至小于单独使用int8(仅提供4倍压缩)。为促进实际应用,我们提出基于性能-存储权衡空间可视化的方法论,用于识别特定内存约束下能最大化性能的最优配置方案。